Here we will compare two different models, Price ~ TDate + Age + Metro + Latitude and Price ~ TDate + Age + Stores + Latitude
fit5 <- lm(Price ~ TDate + Age + Metro + Latitude)
summary(fit1)
##
## Call:
## lm(formula = Price ~ TDate + Age + Stores + Latitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.620 -5.601 -0.714 4.207 80.465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.742e+04 3.524e+03 -4.944 1.12e-06 ***
## TDate 3.613e+00 1.686e+00 2.143 0.0327 *
## Age -3.020e-01 4.178e-02 -7.227 2.44e-12 ***
## Stores 1.929e+00 1.801e-01 10.712 < 2e-16 ***
## Latitude 4.078e+02 4.278e+01 9.534 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.654 on 409 degrees of freedom
## Multiple R-squared: 0.5015, Adjusted R-squared: 0.4966
## F-statistic: 102.8 on 4 and 409 DF, p-value: < 2.2e-16
summary(fit5)
##
## Call:
## lm(formula = Price ~ TDate + Age + Metro + Latitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.218 -5.269 -0.700 4.433 70.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.767e+04 3.359e+03 -5.262 2.30e-07 ***
## TDate 5.570e+00 1.619e+00 3.440 0.000642 ***
## Age -2.530e-01 4.001e-02 -6.323 6.71e-10 ***
## Metro -5.764e-03 4.493e-04 -12.829 < 2e-16 ***
## Latitude 2.607e+02 4.569e+01 5.705 2.23e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.225 on 409 degrees of freedom
## Multiple R-squared: 0.5448, Adjusted R-squared: 0.5403
## F-statistic: 122.4 on 4 and 409 DF, p-value: < 2.2e-16
Based off of our summary table we can see that fit1 has a correlation coefficient of 0.5015 whereas fit5 has a correlation coefficient of 0.5448. Since our model fit5 (Price ~ TDate + Age + Metro + Latitude) explains more of the variance in house price given its predictors, we will use this model. On top of this, we see that the main effect of TDate is also significant at a high level in fit5 than in fit1.
Transforming the Predictors
Since the predictor Age contains values that are 0, and we cannot power transform or log transform these predictors. Therefore, we must do some manipulation to make sure our predictors are stricly non-zero.
RealEstate$Age <- with(RealEstate, (Age + 0.01))
Now we test to see if we need to power transform:
RE.pt <- powerTransform(cbind(TDate, Age, Metro, Latitude) ~ 1, RealEstate)
summary(RE.pt)
## bcPower Transformations to Multinormality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## TDate 3.0000 1.0 -503.9213 509.9213
## Age 0.5469 0.5 0.4751 0.6187
## Metro 0.0780 0.0 -0.0020 0.1581
## Latitude 3.0000 1.0 -147.6445 153.6446
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0 0 0) 451.5795 4 < 2.22e-16
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1 1 1) 557.2776 4 < 2.22e-16
Here we can see all predictors except for Age and Metro contain 1, thus we must look to see what transformation of age and Metro we need.
testTransform(RE.pt, lambda = c(1, 0.5, 0, 1))
## LRT df pval
## LR test, lambda = (1 0.5 0 1) 5.530874 4 0.23703
Thus if we take the square root of Age and a log transformation of Metro,
sqrt_Age <- sqrt(Age)
RE_trsf = with(RealEstate, data.frame(Price, TDate, sqrt(Age), log(Metro), Latitude))
pairs(RE_trsf)

Now we can see linear relationships of all of the predictors with the response.
Transforming the Response
Now that we have found transformations for our predictors, we will look to see if we need to tranform our response.
boxCox(fit5)

Since we see that 0 is within our interval, we choose lambda = 0 and log transform our response Price. Hence, our model looks like log(Price) ~ TDate + sqrt(Age) + log(Metro) + Latitude.
final.fit <- lm(log(Price) ~ TDate + sqrt(Age) + log(Metro) + Latitude)
summary(fit5)
##
## Call:
## lm(formula = Price ~ TDate + Age + Metro + Latitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.218 -5.269 -0.700 4.433 70.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.767e+04 3.359e+03 -5.262 2.30e-07 ***
## TDate 5.570e+00 1.619e+00 3.440 0.000642 ***
## Age -2.530e-01 4.001e-02 -6.323 6.71e-10 ***
## Metro -5.764e-03 4.493e-04 -12.829 < 2e-16 ***
## Latitude 2.607e+02 4.569e+01 5.705 2.23e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.225 on 409 degrees of freedom
## Multiple R-squared: 0.5448, Adjusted R-squared: 0.5403
## F-statistic: 122.4 on 4 and 409 DF, p-value: < 2.2e-16
summary(final.fit)
##
## Call:
## lm(formula = log(Price) ~ TDate + sqrt(Age) + log(Metro) + Latitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57902 -0.10462 0.01289 0.11008 0.96421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.327e+02 7.539e+01 -8.392 7.87e-16 ***
## TDate 1.794e-01 3.664e-02 4.897 1.40e-06 ***
## sqrt(Age) -4.780e-02 6.657e-03 -7.180 3.31e-12 ***
## log(Metro) -2.050e-01 1.053e-02 -19.472 < 2e-16 ***
## Latitude 1.108e+01 9.356e-01 11.840 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.208 on 409 degrees of freedom
## Multiple R-squared: 0.7218, Adjusted R-squared: 0.719
## F-statistic: 265.2 on 4 and 409 DF, p-value: < 2.2e-16
Comparing our fit from before our transformation on Price, Age, and Metro and after, we can see a large increase in our correlation coefficient. Before it was 0.5448, and after taking log(Price), sqrt(Age), and log(Metro), it is 0.7218. Hence we saw a large imporovemnet in our model after doing our transformations.